library(dplyr)
library(tidyverse) # data manipulation
library(factoextra) # clustering algorithms & visualization
library(caret) # model training process
library(e1071) # supports predict() , plot()
library(dplyr) # supports data manipulations(select,filter,mutate)
library(normalr) #normalization of large dataset
library(fpc) # flexible procedure for clustering
library(flexclust) # k-centroids ,cluster analysis supporting arbitrary distance measures and centroid computation
library(stats) # for statistical calculations and random number generation
library(ggplot2) # visualization of data
library(ggfortify) # supports plotting tools for statistical clustering using ggplot2
library(lattice) # data visualization
#install.packages("webshot")
#webshot::install_phantomjs()# for resolving html () problem in knitting
library('e1071')
library('caret')
library('pROC') # for Naive Bayes roc
library(ggplot2) # visualization of data
library(ggfortify) # supports plotting tools for statistical clustering using ggplot2
library(lattice) # data visualization
library('klustR') # pacoplot
#Loading Bath Soap data :
BathSoap <- read.csv("~/Downloads/BathSoap.csv")
#Checking NA values
sapply(BathSoap, function(x) sum(is.na(x))) ## No NA
1.No. of Brands
2.Brand Runs
3.Total Volume
4.No. of transaction
5.Value
6.Trans Brand Runs
7.Volume transaction
8.Average Price
9.Others999
10.Max Brand (Explained below)
Since CRISA marketing agency is using data for general marketing purposes , a customer who is loyal to Brand 1 is same as a customer who is loyal to Brand 2.They both will be equally loyal for the agency.
If we include all the brand shares into the data the clustering would treat them differently,however for the general marketing data analysis , it should be treated same.
Therefore,we will create a variable which will have the maximum of all the purchase share.
#Creating new column Max_Brand
BathSoap$newcolumn<-NA #Creating new column
names(BathSoap)[47]<-'Max_Brand' #Assigning Name to the column
as.matrix(BathSoap)
#max to one brand
BathSoap$Max_Brand<-apply(BathSoap[,c(23:30)],1,max)
Purchase_Behaviour_df <- BathSoap[,c(12:19,31,47)] #sex,age,no. of brands,avg price,Pur.Vol.No.Promo
str(Purchase_Behaviour_df)
## 'data.frame': 600 obs. of 10 variables:
## $ No..of.Brands : int 3 5 5 2 3 3 4 3 2 4 ...
## $ Brand.Runs : int 17 25 37 4 6 26 17 8 12 13 ...
## $ Total.Volume : int 8025 13975 23100 1500 8300 18175 9950 9300 26490 7455 ...
## $ No..of..Trans : int 24 40 63 4 13 41 26 25 27 18 ...
## $ Value : num 818 1682 1950 114 591 ...
## $ Trans...Brand.Runs: num 1.41 1.6 1.7 1 2.17 1.58 1.53 3.13 2.25 1.38 ...
## $ Vol.Tran : num 334 349 367 375 638 ...
## $ Avg..Price : num 10.19 12.03 8.44 7.6 7.12 ...
## $ Others.999 : chr "49.2%" "69.9%" "37.9%" "0.0%" ...
## $ Max_Brand : chr "38%" "8%" "55%" "60%" ...
#Converting percentage to numeric by removing percentage sign
Purchase_Behaviour_df[,c(9,10)] <- data.frame(sapply(Purchase_Behaviour_df[,c(9,10)], function(x) as.numeric(gsub("%", "", x))))
#normalizing values:
Purchase_Behaviour_norm <- sapply(Purchase_Behaviour_df, scale)
#Calculating the distance of the normalized Universities data
distance_Purchase_Behaviour_norm <- get_dist(Purchase_Behaviour_norm)
#Visualization of the distance matrix
fviz_dist(distance_Purchase_Behaviour_norm)
After analyzing the data , we can consider below variables to understand Customer’s Purchase behaviour :-
Promotion vol
Price : Pr.Cat1 ,Pr.Cat2 ,Pr.Cat3 ,Pr.Cat4
Selling Propositions : Prop_Cat5,….Prop_Cat15
Analyzing Selling Propositions:
# Lets analyze the Selling Prepositions for
Selling_Prop<- BathSoap[,c(36:46)]
# Replacing % sign from the data
Selling_Prop <- data.frame(sapply(Selling_Prop, function(x) as.numeric(gsub("%", "", x))))
#Mean value for each variable
summary(Selling_Prop)
## PropCat.5 PropCat.6 PropCat.7 PropCat.8
## Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 16.00 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 44.00 Median : 2.000 Median : 1.000 Median : 1.000
## Mean : 45.72 Mean : 9.238 Mean : 9.688 Mean : 8.018
## 3rd Qu.: 72.00 3rd Qu.:10.000 3rd Qu.: 8.000 3rd Qu.: 9.000
## Max. :100.00 Max. :97.000 Max. :100.000 Max. :96.000
## PropCat.9 PropCat.10 PropCat.11 PropCat.12
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 0.000 Median : 0.000 Median : 0.000 Median : 0.00
## Mean : 3.085 Mean : 2.037 Mean : 2.942 Mean : 0.62
## 3rd Qu.: 3.000 3rd Qu.: 0.000 3rd Qu.: 1.000 3rd Qu.: 0.00
## Max. :41.000 Max. :100.000 Max. :90.000 Max. :33.00
## PropCat.13 PropCat.14 PropCat.15
## Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.000 Median : 0.00 Median : 0.000
## Mean : 2.505 Mean : 13.65 Mean : 2.535
## 3rd Qu.: 1.000 3rd Qu.: 12.00 3rd Qu.: 0.000
## Max. :100.000 Max. :100.00 Max. :84.000
#Visualization
boxplot(Selling_Prop)
The Boxplot and summary statistics for selling proposition depicts that only Selling Proposition-5(PropCat5) has received significant response from the customers.It has more highest percentage of households investing more that 10% of the total purchase volume.
The rest of the selling propositions have less than 10 % response .
Therefore , It is better to exclude rest of the selling propositions categories and include only Selling Proposition-5(PropCat5) into the model.
Basis_Purchase_df<-BathSoap[,c(20:22,32:36)]
Basis_Purchase_df <- data.frame(sapply(Basis_Purchase_df, function(x) as.numeric(gsub("%", "", x))))
str(Basis_Purchase_df)
## 'data.frame': 600 obs. of 8 variables:
## $ Pur.Vol.No.Promo.... : num 100 89 94 100 61 100 98 94 90 100 ...
## $ Pur.Vol.Promo.6.. : num 0 10 2 0 14 0 2 0 10 0 ...
## $ Pur.Vol.Other.Promo..: num 0 2 4 0 24 0 0 6 0 0 ...
## $ Pr.Cat.1 : num 23 29 12 0 0 22 7 4 11 61 ...
## $ Pr.Cat.2 : num 56 55 32 40 5 45 66 4 89 10 ...
## $ Pr.Cat.3 : num 13 9 56 60 14 7 5 90 0 12 ...
## $ Pr.Cat.4 : num 7 6 0 0 81 27 23 2 0 17 ...
## $ PropCat.5 : num 50 46 24 40 81 49 82 6 70 24 ...
class(Basis_Purchase_df)
## [1] "data.frame"
#Normalization
Basis_Purchase_norm <- sapply(Basis_Purchase_df, scale)
#Calculating the distance of the normalized Universities data
distance_Basis_Purchase_norm <- get_dist(Basis_Purchase_norm)
#Visualization of the distance
fviz_dist(distance_Basis_Purchase_norm)
#Including all the variables from Purchase behaviour and basis of purchase
Basis_Behaviour_Purchase_df<-cbind.data.frame(Purchase_Behaviour_df,Basis_Purchase_df)
#Normalization of the data
Basis_Behaviour_Purchase_norm <- sapply(Basis_Behaviour_Purchase_df, scale)
#Calculating the distance of the normalized Universities data
distance_Basis_Behaviour_Purchase_norm <- get_dist(Basis_Behaviour_Purchase_norm)
#Visualization of the distance
fviz_dist(distance_Basis_Behaviour_Purchase_norm)
K should be chosen on the basis of minimum intra-cluster distance and maximum distance between the clusters. As it was mentioned that marketing efforts would support least 2-5 different promotional approaches,Therefore I would like to keep the range for the k within 2-5.
Clusters should be interpretable and actionable. This in a way limits number of clusters. Too many clusters will lead to losing the ability to interpret the clusters . Less number of cluster might risk towards generalizing , creating a simplistic treatment and missing the opportunity for a more tailored and effective approach.
To determine the optimal value of k , using silhouette and dunn’s Index method for each of the categories:-
#Creating variable p1 ,p2 to store fviz_nbclust() output for different methods:
P1<-fviz_nbclust(Purchase_Behaviour_norm, kmeans, method = "silhouette",k.max=5) + ggtitle("Purchase Behaviour")
P2<-fviz_nbclust(Basis_Purchase_norm, kmeans, method = "silhouette",k.max=5) + ggtitle("Basis for Purchase")
P3<-fviz_nbclust(Basis_Behaviour_Purchase_norm, kmeans, method = "silhouette",k.max=5)+ ggtitle("Basis for purchase + Purchase Behaviour")
#Plotting based on Silhouette Method and ELbow method :
gridExtra::grid.arrange(P1, P2,P3, nrow = 1)
Result: As per the silhouette method we can see that the optimal value forthe below :- + Purchase Behavior ,k= 2 , + Basis of purchase k= 4 and + the combination of both has optimal k =5.
pacoplot:Creates an interactive parallel coordinates plot detailing each dimension and the cluster associated with each observation.
Website used for reference :- https://www.rdocumentation.org/packages/klustR/versions/0.1.0/topics/pacoplot Note: I have used pacoplot for cluster associated observations, however due to them to be html widgets, they do not appear in the knitted documentation.
#When K=2,
set.seed(123)
kmeans2_Purchase_Behaviour <- kmeans(Purchase_Behaviour_norm, centers = 2) #kmeans for Purchase Behaviour
kmeans2_Purchase_Behaviour$size
## [1] 393 207
fviz_cluster(kmeans2_Purchase_Behaviour, data = Purchase_Behaviour_norm)
kmeans2_Purchase_Behaviour$withinss
## [1] 2646.517 2078.473
kmeans2_Purchase_Behaviour$betweenss
## [1] 1265.009
#Visualization of the data to understand the features of the clusters within each segment:
pacoplot(data = Purchase_Behaviour_norm, clusters = kmeans2_Purchase_Behaviour$cluster)
size : 265,335
withinss : 2460.28 2256.83
betweenss :1272.891
Cluster1(Orange): This cluster seems to buys more from other 999 brands and have highest no. of brands and brand runs but their transaction volume is low. Therefore ,lowest in brand loyalty.
Cluster2: This cluster seems to be highest in brand loyalty and they have highest volume of transactions and Transaction brand runs.
Lets , find the kmeans for “Basis for Purchase” when k=4 :
#When K=4
set.seed(789)
kmeans4_Basis_Purchase<- kmeans(Basis_Purchase_norm, centers = 4)
fviz_cluster(kmeans4_Basis_Purchase, data = Basis_Purchase_norm)
#size
kmeans4_Basis_Purchase$size
## [1] 74 320 19 187
#withiness
kmeans4_Basis_Purchase$withinss
## [1] 553.9459 1108.6059 292.4255 1007.2238
#betweenss
kmeans4_Basis_Purchase$betweenss
## [1] 1829.799
#Cluster Visualization
pacoplot(data = Basis_Purchase_norm, clusters = kmeans4_Basis_Purchase$cluster)
Cluster size : 74 320 19 187
withiness :553.9459 1108.6059 292.4255 1007.2238
betweenss :1829.799
Basis of Purchase :
CLuster 1:Customers purchasing without responding to promotions and responding price category 1 and 2 with selling propositions.
Cluster 2:Customers not responding to promotions and responding to chosen selling proposition and price category 2&4
Cluster 3:Customers neutral towards promotions however responding to selling propositions and price category 1 and 2 Customers not responding to promotions and price category 3
Cluster 4:Customers are purchasing without any promotions however responding to price category 2 and selling propositions.
set.seed(666)
kmeans5_Basis_Behaviour_Purchase <- kmeans(Basis_Behaviour_Purchase_norm, centers = 5)
fviz_cluster(kmeans5_Basis_Behaviour_Purchase, data = Basis_Behaviour_Purchase_norm)
#size
kmeans5_Basis_Behaviour_Purchase$size
## [1] 132 175 66 172 55
#withinss
kmeans5_Basis_Behaviour_Purchase$withinss
## [1] 1639.2775 1924.4079 747.9557 1727.6913 603.6403
#betweenss
kmeans5_Basis_Behaviour_Purchase$betweenss
## [1] 4139.027
#Visualization
pacoplot(data = Basis_Behaviour_Purchase_norm, clusters = kmeans5_Basis_Behaviour_Purchase$cluster,labelSizes=list(yaxis=6, yticks = 10, tooltip = 15))
Cluster size :132 175 66 172 55
withiness :1639.2775 1924.4079 747.9557 1727.6913 603.6403
betweeness: 4139.027
The Customers ares Non Loyal to the Brands ,purchases from lot of different brands with high transaction volume and value and interestingly do not responds to the promotional offers and highly responds to the priceCategory1 and 2
Cluster of customer show high brand loyalty and sometimes not loyalty(neutral) ,we can call them as “grey clusters” does not responds to promotional offers but responds to selling propositions.
Cluster 3(Green): Brand Loyal customers with higher transaction brand runs , does not responds to Promotional offers and does shopping Price category3
Cluster 4:(Red)
non brand loyal customers purchasing other brands in high volume and responding to price category 2 and promotional offers.
Non Brand Loyal customers purchasing high numbers of different brands and responding to promotional and price category 1 and selling proposition chosen.
I believe , the Segment having details of both Purchase Behavior and basis of purchase should be considered as the best segment .
Reason:
Having more data is always a good idea, specially when the client is looking for more number of promotional approaches. For example, the purchase Behavior depicts the best cluster statistics , however we get to know only about loyal and non loyal customers and their trends. If we can understand their basis of purchase, better strategies can be made to gather the attention from the customers.
CRISA is a marketing agency and owns the data, which it collected at considerable expense, so it will want to be able to use both the data and the segmentation analysis in different ways for different clients.
Lets, Understand the characteristics of the combined data for Purchase Behavior and Basis of purchase based on the demographics. Since ,we have already analyzed and Interpreted the clusters based on the purchase behavior and Basis of purchase.Here , we will focus on the demographics and will Interpret based on previous analysis regarding brand loyalty and basis of purchase.
Basis_Behaviour_Purchase_df2<-cbind.data.frame(Basis_Behaviour_Purchase_df,BathSoap[,c(2:11)])
Basis_Behaviour_Purchase_df2<- as.matrix(Basis_Behaviour_Purchase_df2)
#Adding new column 'cluster' to mention the cluster no. in dataset
Basis_Behaviour_Purchase_df2 <- data.frame(Basis_Behaviour_Purchase_df2,
cluster = as.factor(kmeans5_Basis_Behaviour_Purchase$cluster))
table(kmeans5_Basis_Behaviour_Purchase$cluster)
##
## 1 2 3 4 5
## 132 175 66 172 55
##Gender
barplot(table(Basis_Behaviour_Purchase_df2$SEX,Basis_Behaviour_Purchase_df2$cluster),
main="Gender",
xlab="Clusters",
ylab="Count of people",
col=c("darkblue","red","yellow"),
legend=rownames(table(Basis_Behaviour_Purchase_df2$SEX,Basis_Behaviour_Purchase_df2$cluster)))
Interpretation: We can see that , Females does the purchase the most among all the clusters, regardless weather they are brand loyal or they responds to selling proposition and offers.
##Age
barplot(table(Basis_Behaviour_Purchase_df2$AGE,Basis_Behaviour_Purchase_df2$cluster),
main="Age",
xlab="Clusters",
ylab="Count of people",
col=c("darkblue","red","yellow","pink"),
legend=rownames(table(Basis_Behaviour_Purchase_df2$AGE,Basis_Behaviour_Purchase_df2$cluster)))
Interpretation: We can see that , Age group 4 are the highest in all the clusters . Therefore,They are doing most of the purchases.
##Socio economic
barplot(table(Basis_Behaviour_Purchase_df2$SEC,Basis_Behaviour_Purchase_df2$cluster),
main="Socio economic",
xlab="Clusters",
ylab="Count of people",
col=rainbow(10),
legend=rownames(table(Basis_Behaviour_Purchase_df2$SEC,Basis_Behaviour_Purchase_df2$cluster)))
Interpretation: Cluster 4 ,Cluster 5, cluster 2 and Cluster 3 has customers with high socio economic status. CLuster 1 has poor socio economic status.
## Affluence Index
barplot(table(Basis_Behaviour_Purchase_df2$Affluence.Index,Basis_Behaviour_Purchase_df2$cluster),
main=" Affluence Index",
xlab="Clusters",
ylab="Count of people",
col=rainbow(10),
legend=rownames(table(Basis_Behaviour_Purchase_df2$Affluence.Index,Basis_Behaviour_Purchase_df2$cluster)))
Interpretation:
As per the chart it is clear that there is no trend or pattern in Affluence Index among all the clusters and it looks like a rainbow.
## Education
barplot(table(Basis_Behaviour_Purchase_df2$EDU,Basis_Behaviour_Purchase_df2$cluster),
main="Education",
xlab="Clusters",
ylab="Count of people",
col=rainbow(12),
legend=rownames(table(Basis_Behaviour_Purchase_df2$EDU,Basis_Behaviour_Purchase_df2$cluster)))
Interpretation: Most of the people with Education level=5(college graduate) tends to be purchasing more in cluster 1,2,4.
## Mother Tongue
barplot(table(Basis_Behaviour_Purchase_df2$MT,Basis_Behaviour_Purchase_df2$cluster),
main="Mother Tongue",
xlab="Clusters",
ylab="Count of people",
col=rainbow(10),
legend=rownames(table(Basis_Behaviour_Purchase_df2$MT,Basis_Behaviour_Purchase_df2$cluster)))
Interpretation: It seems that Language 10 seems to be dominating among all the clusters.Maybe most of the cities covered in the survey were in the same regional state of India.
## Number of Members in a household
barplot(table(Basis_Behaviour_Purchase_df2$HS,Basis_Behaviour_Purchase_df2$cluster),
main="Number of members in household",
xlab="Clusters",
ylab="Count of people",
col=rainbow(10),
legend=rownames(table(Basis_Behaviour_Purchase_df2$HS,Basis_Behaviour_Purchase_df2$cluster)))
Interpretation: CLuster 1 and 4 have average household family size of 3-4 which does most of the shopping.However ,a family size of 5 is quite dominating among all the clusters.
## Eating Habbits
barplot(table(Basis_Behaviour_Purchase_df2$FEH,Basis_Behaviour_Purchase_df2$cluster),
main="Eating Habbits",
xlab="Clusters",
ylab="Count of people",
col=rainbow(10),
legend=rownames(table(Basis_Behaviour_Purchase_df2$FEH,Basis_Behaviour_Purchase_df2$cluster)))
Interpretation: Large amount of people eating non-vegetarian food among all the clusters seems to purchase the items.
## Availability of TV
barplot(table(Basis_Behaviour_Purchase_df2$CS,Basis_Behaviour_Purchase_df2$cluster),
main="Availability of TV",
xlab="Clusters",
ylab="Count of people",
col=c("darkblue","red","yellow"),
legend=rownames(table(Basis_Behaviour_Purchase_df2$CS,Basis_Behaviour_Purchase_df2$cluster)))
Interpretation: As we can see that TV is available in higher numbers for all the clusters.Therefore,Most of the customers have TV as per the data collected by CRISA.
## Number of Children
barplot(table(Basis_Behaviour_Purchase_df2$CHILD,Basis_Behaviour_Purchase_df2$cluster),
main=" Number of children",
xlab="Clusters",
ylab="Count of people",
col=rainbow(10),
legend=rownames(table(Basis_Behaviour_Purchase_df2$CHILD,Basis_Behaviour_Purchase_df2$cluster)))
Interpretation:The households having 4-5 children are doing more purchase among all the clusters.
1.Most of the consumers are female,thus most of the ads should target for women.
2.Most of the customers are not loyal they buy Value added packs.
3.Most of the customers have TV, advertisement can be broadcaster.
4.Client should promote their brands by gifting coupon or exchange offers.
Cluster 1:
Cluster of customer show high brand loyalty and sometimes not loyalty we can say that its a neutral cluster, does not responds to promotional offers but responds to selling propositions.
Demographically it is the lowest socio economic group with majority of the people have poor education or college passout.
CLuster 2:
Again ,the Customers ares Non Loyal to the Brands and interestingly do not responds to the promotional offers and highly responds to the price Category1.
Demographically it is upper middle socio economic status with a family of 3-4 members with highest majority of women and most of them are educated.
Cluster 3:
Brand Loyal customers with higher transaction brand runs , does not responds to Promotional offers and does shopping Price category3.
Demographically it is high socio economic to upper middle majority group with basic education level in majority.(level 3-4)
Cluster 4:
Non brand loyal customers purchasing other brands in high volume and responding to price category 2.
Demographically it is high socio economic group with 4 household members and basic education level.
Cluster 5:
Non Brand Loyal customers purchasing high numbers of different brands and responding to promotional and price category 1 and chosen selling proposition.
Demographically it is high socio economic class in majority and customers are mostly college students and higher studies.
Choosing a cluster(market segment) that would be defined as Success: Based on the clusters features , we can see a major difference by the Brand loyalty, secondly the relation with socio economic status.
We can have multiple approaches for promotions for these customers segments .For example :
1.Focusing on the Brand Loyal customers with high socio economic status: Based on their purchase behaviour , clients can target these customers with customized approach for them.
2.Focusing on the Non Brand loyal customers with high socio economic status: These group of customers gives a great opportunity to the Client to grow there business by making wise decision on the promotions.
Therefore, I would like to choose, Cluster 4 which is non brand loyal customer group with high socio economic status with transaction volume and they responds to the promotional offers and selling propositions. Majority of them are women and educated.
#Creating a variable for the Combination dataset with demographics
BathSoap_Model<-Basis_Behaviour_Purchase_df2
# The selected cluster == 5.Therefore, we need to update the cluster details to 1's,0's.To predict it specifically.
BathSoap_Model$cluster=ifelse(BathSoap_Model$cluster=="4",1,0)
#Factorization of data
BathSoap_Model$cluster<-as.factor(BathSoap_Model$cluster)
BathSoap_Model$SEC<-as.factor(BathSoap_Model$SEC)
BathSoap_Model$FEH<-as.factor(BathSoap_Model$FEH)
BathSoap_Model$MT<-as.factor(BathSoap_Model$MT)
BathSoap_Model$SEX<-as.factor(BathSoap_Model$SEX)
BathSoap_Model$AGE<-as.factor(BathSoap_Model$AGE)
BathSoap_Model$EDU<-as.factor(BathSoap_Model$EDU)
BathSoap_Model$HS<-as.factor(BathSoap_Model$HS)
BathSoap_Model$CHILD<-as.factor(BathSoap_Model$CHILD)
BathSoap_Model$CS<-as.factor(BathSoap_Model$CS)
BathSoap_Model$Affluence.Index<-as.factor(BathSoap_Model$Affluence.Index)
set.seed(777)
#Partitioning the dataset to build a model and predict on the validation dataset
partition<- createDataPartition(BathSoap_Model$No..of.Brands,p=0.6,list=FALSE)
train_data<- BathSoap_Model[partition,]
validation_data<- BathSoap_Model[-partition,]
#Checking the count of the partitioned data
nrow(validation_data)
## [1] 238
nrow(train_data)
## [1] 362
Naive Bayes is easy and fast to predict class of test data set. It performs well in case of categorical input variables compared to numerical variable(s) and we have Demographics data in the dataset.
So let’s try this model and check the results .
set.seed(3689)
# Building Naive Bayes Model
nb_model<-naiveBayes(cluster~., data=train_data)
# Prediction
Predicted_Test_labels <-predict(nb_model,validation_data)
validation_data<-as.data.frame(validation_data)
# Show the confusion matrix of the classifier
confusionMatrix(validation_data$cluster,Predicted_Test_labels)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 143 26
## 1 9 60
##
## Accuracy : 0.8529
## 95% CI : (0.8015, 0.8954)
## No Information Rate : 0.6387
## P-Value [Acc > NIR] : 1.494e-13
##
## Kappa : 0.6671
##
## Mcnemar's Test P-Value : 0.006841
##
## Sensitivity : 0.9408
## Specificity : 0.6977
## Pos Pred Value : 0.8462
## Neg Pred Value : 0.8696
## Prevalence : 0.6387
## Detection Rate : 0.6008
## Detection Prevalence : 0.7101
## Balanced Accuracy : 0.8192
##
## 'Positive' Class : 0
##
1.Accuracy is 88 % with Sensitivity of 96% and Specificity of 66%.Error type I and Error Type II are 20 and 7.
An ROC curve is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate. False Positive Rate.
Let’s determine the ROC curve :
set.seed(123)
table(validation_data$cluster)
##
## 0 1
## 169 69
# ROC curve
Predicted_Test_2labels <-predict(nb_model,validation_data, type = "raw")
table(Predicted_Test_2labels==1)
##
## FALSE TRUE
## 438 38
roc(validation_data$cluster, Predicted_Test_2labels[,2])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.default(response = validation_data$cluster, predictor = Predicted_Test_2labels[, 2])
##
## Data: Predicted_Test_2labels[, 2] in 169 controls (validation_data$cluster 0) < 69 cases (validation_data$cluster 1).
## Area under the curve: 0.9497
plot.roc(validation_data$cluster,Predicted_Test_2labels[,2])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
Therefore, based on the results we can say that,Naive Bayes model is effective in classifying data.Also, the model is not 100% accurate , it has an accuracy of 88%.The Clients can target those segment based customers from the model.